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Abstract 

Background: Fused genes are important sources of data for studies of evolution and protein function. To date no 
service has been made available online to aid in the large-scale identification of fused genes in sequenced 
genomes. We have developed a program, Gene deFuser, that analyzes uploaded protein sequence files for 
characteristics of gene fusion events and presents the results in a convenient web interface. 

Results: To test the ability of this software to detect fusions on a genome-wide scale, we analyzed the 24,725 
gene models predicted for the ciliated protozoan Tetrahymena thermophila. Gene deFuser detected members of 
eight of the nine families of gene fusions known or predicted in this species and identified nineteen new families 
of fused genes, each containing between one and twelve members. In addition to these genuine fusions, Gene 
deFuser also detected a particular type of gene misannotation, in which two independent genes were predicted as 
a single transcript by gene annotation tools. Twenty-nine of the artifacts detected by Gene deFuser in the initial 
annotation have been corrected in subsequent versions, with a total of 25 annotation artifacts (about 1/3 of the 
total fusions identified) remaining in the most recent annotation. 

Conclusions: The newly identified Tetrahymena fusions belong to classes of genes involved in processes such as 
phospholipid synthesis, nuclear export, and surface antigen generation. These results highlight the potential of 
Gene deFuser to reveal a large number of novel fused genes in evolutionarily isolated organisms. Gene deFuser 
may also prove useful as an ancillary tool for detecting fusion artifacts during gene model annotation. 



Background 

Fusion genes, also known as chimeric genes, are formed 
when the reading frames of two or more distinct genes 
are joined together by recombination events such as 
unequal crossing over, transposition, and deletion [1]. 
After the fusion, the new gene codes for a single, novel 
protein that is a hybrid of the two separate proteins, 
where each part performs a discrete function and has an 
independent evolutionary history. Although very few of 
these recombination events produce proteins that retain 
their proper function or expression pattern, on occasion 
the constituent genes do combine to form a new, work- 
ing gene that can be passed on to offspring [2], Genera- 
tion of new multidomain proteins by gene fusions is a 
major mechanism by which functional complexity has 
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evolved in multicellular eukaryotes [1,2], and many key 
proteins currently under research, including Hedgehog 
[3], Type II Topoisomerase, and RNA Polymerase [4], 
began as fusions of genes in the ancestors of eukaryotes. 

Successful fusion requires that both halves of the new 
gene function properly despite the loss of expression 
elements from the downstream gene, which falls under 
control of the upstream promoter. Therefore, only 
fusions in which the two linked proteins can function in 
the same compartment of the cell, at the same develop- 
mental stage, and in response to the same stimuli will 
be tolerated. While it has been hypothesized that two 
genes with unrelated functions may merge and be 
retained in the genome [4,5], almost all bifunctional 
fusion genes seen to date show a functional relationship 
between the proteins that comprise the fusion. Related 
genes are more likely to result in a functional fusion 
gene, and may even confer a selective advantage to the 
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organism in some cases. Most fused gene pairs have 
orthologs that are part of the same metabolic pathway, 
are involved in the same protein complex [6], or regu- 
late one another's activity [5]. A selective advantage may 
emerge if the fused protein leads to a greater catalytic 
activity or more efficient co-regulation than is possible 
for the two independent proteins. 

Given these complex requirements, gene fusions are 
rarely successful, and few examples exist of analogous 
recombinations occurring in multiple unrelated taxa by 
convergent evolution [2,7]. These requirements also 
guarantee that the split of a fusion gene into its two 
component proteins is much rarer than the original 
fusion events. Studies have estimated that gene fusion is 
approximately four times more common than gene fis- 
sion events, in which a single gene splits into multiple, 
smaller coding segments [8]. The predominance of gene 
fusions over gene fissions is expected in part because 
gene fusions can result in the potentially favorable cou- 
pling of proteins with related biological functions, rather 
than the unfavorable separation of proteins whose 
shapes and functions have evolved together over time 
[6]. Additionally, gene fusion involves the loss of the ter- 
mini of the genes being fused, a much simpler process 
than fission, which requires that the genes somehow 
obtain a promoter, terminator, start codon, and stop 
codon when the gene splits. 

The scarce and persistent nature of gene fusions 
makes them ideal macromolecular markers of evolu- 
tion and, like insertions, deletions, and other genomic 
rearrangements, they have long served as data for phy- 
logenetic analysis. The usefulness of gene fusions in 
studies of this type was featured in 2003 when, follow- 
ing the attempts of many different research groups to 
locate the root of the eukaryote tree by a variety of 
methods, the presence of a fusion between dihydrofo- 
late reductase (DHFR) and thymidylate synthase (TS) 
in plants and many protozoan species, but not in ani- 
mals and fungi, supported rooting of the eukaryotic 
tree between these groups [9,10]. Though gene losses 
and horizontal gene transfer have complicated the con- 
clusions that can be reached from these single-charac- 
ter analyses [11,12], gene fusions may still provide 
some of the most reliable information about the dee- 
pest branching taxa. 

In addition to their usefulness in phylogenetic studies, 
gene fusions can also serve as Rosetta Stone proteins 
that provide information about their constituent genes. 
Since the fused proteins are likely to be functionally 
related, characterization of each constituent gene 
informs researchers about their homologs in other gen- 
omes [4,13]. In the majority of cases where annotation 
of the function of fusion proteins in eukaryotes and pro- 
karyotes is available, the constituent proteins are 



involved in core metabolism, which may help research- 
ers understand both simple and more complex biologi- 
cal metabolic systems [13]. In particular, fusion proteins 
in eukaryote genomes have been used to identify hidden 
protein-protein interactions [13]. 

Despite their important uses in evolutionary studies as 
powerful phylogenetic markers, and in functional studies 
as windows into biochemical pathways and protein 
interactions, few of the fusion genes present in eukar- 
yotes have been identified and studied in depth. 
Researchers have previously created programs to find 
fusion genes in specific genomes [14,15]. However, to 
date no large-scale service has been made available to 
the public to aid in the identification of fusions in large, 
genome-sized data sets. Here we present a new bioinfor- 
matics tool, Gene deFuser, which we have developed for 
this purpose. The underlying algorithm compares 
BLAST results from the beginning and end of protein 
sequences submitted through an online interface. Puta- 
tive gene fusions are displayed for the user in a conveni- 
ent interface that simplifies further analysis of the 
candidate genes. Gene deFuser is based on programs we 
have used previously to identify gene fusions in the for- 
maldehyde detoxification pathways of ciliates and dia- 
toms [16] and in the methionine salvage pathway of 
Tetrahymena [17]. To highlight the value of this service, 
we present an in depth survey of the results obtained 
for the predicted proteome of Tetrahymena thermo- 
phila, which includes the identification of several new 
types of fusion genes. 

During this survey we also identified a large number 
of misannotated genes models, which can be attributed 
to a common artifact of gene prediction software in 
which two genes are merged into a single transcript. 
Comparison of Gene deFuser results for the first and 
final versions of the Tetrahymena genome showed that 
about half of the artifacts found in the initial scan of the 
genome were corrected over time. Gene deFuser may 
serve as a useful tool to speed the identification of these 
types of artifacts in future genome projects. 

Methods 

The Gene deFuser program utilizes BLAST [18] to 
detect similarities between the two ends of a protein 
and the sequences in a database of orthologous protein 
groups. The program compares these sequences to the 
KOG (eukaryotic orthologous groups) database [19], 
which is a subset of the COG (clusters of orthologous 
genes) database [20] containing groups of orthologous 
genes for seven eukaryotic genomes. Although newer 
and more complete ortholog databases exist, we chose 
the KOG database because it was extensively curated 
and the authors specifically broke down fused genes 
into their component KOG domains [19,20]. This 
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allowed us to identify genes such as DHFR-TS (DTS1 in 
Tetrahymena [16]; see Results) that would have other- 
wise been masked by their presence in one or more of 
the species represented in the KOG database. 

Gene deFuser generates a list of KOG identifiers for 
each end of the protein in question based on the 
BLAST results. The list of identifiers found for the N- 
terminus is then compared with those listed for the C- 
terminus. A typical non-fused protein will match the 
same KOG at both the N-terminus and the C-terminus. 
A protein that returns a matching KOG identifier at 
both ends is presumed to be non-fused and is excluded 
as a possible fusion gene. Proteins that do not share any 
KOG hit at both ends are presented in a list of candi- 
date fusion proteins. This method obviously omits 
fusions that were missed during curation of the KOG 
database. However, any fusion genes missed due to this 
limitation are present in at least several of the model 
organisms used to generate the KOG database and, 
because these genomes are highly studied, these fusions 
are likely to have been described already. The main 
application of the Gene deFuser program is to identify 
novel fusion genes. 

An outline of the methodology used by Gene deFuser 
to identify fused genes is shown in Figure 1. Gene deFu- 
ser accepts as input multiple protein sequences in 
FASTA format and can be used to search files that 
cover the size of a typical genome (-30,000 proteins). 
After the user submits a set of proteins, the program 
extracts a portion of the C-terminus and a portion of 
the N-terminus of each sequence to use as queries in 
BLAST searches. The fraction of the protein used for 
BLAST searches of the C- and N-terminus can be 
adjusted by the user, but the default is set at 30%. Using 
too much of the protein as a query can lead to overlap 
in the KOG hits on both ends and prevent the identifi- 
cation of fused genes; using too little of the protein can 
result in poor BLAST scores. This parameter must be 
set to less than 50% of the sequence to avoid overlap of 
the segments, and after experimenting with different 
values between 20% and 40%, we settled on using the 
first 30% of the protein sequence as the N-terminus 
query and the final 30% of the protein as the C-termi- 
nus query in our analysis of the Tetrahymena genome. 
The default value of 30% brought back 52 sequences 
that we believe are genuine fused genes. When we per- 
formed the search using 20%, the program only detected 
19 of these 52 genes. When we increased to this para- 
meter to 40%, the program appeared to detect a few 
additional fused genes; however, it missed 7 of the 52 
fusions detected using 30% and returned more false 
positives. Based on these observations, the users are 
encouraged to repeat their searches using different 
values of this parameter. 
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Figure 1 Outline of the Gene deFuser algorithm 



After generating files containing the N- and C-termini 
of the proteins, the sequences are used to search the 
KOG database using BLASTP. We downloaded the 
KOG database on July 10, 2010 and modified the dataset 
by eliminating all protein sequences not assigned to a 
KOG. About 54% (59,838) of the 110,655 gene products 
analyzed to create the KOG database are included in 
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4,852 clusters of orthologs, while the remaining genes 
are not assigned to an ortholog group. The BLAST 
search results for the N- and C-termini that exceed a 
user-defined threshold (default: e-value < le-10) are 
parsed to determine the KOG of each hit, and then a 
combined score for each KOG is calculated using the 
methodology described in Zhou and Landweber [21]. 
Using this methodology, each KOG with a significant 
hit is assigned a score given by the following formula: 



where N is the number of sequences belonging to the 
KOG group and Pi = l-exp(Ei), where Ei is the e-value 
of the blast hit to a given sequence. For sequences with- 
out significant blast hits, that is, with an e-value larger 
than the e-value threshold, Pi = 1. 

KOGs that score higher than a cutoff threshold set by 
the user (default = 5) are used in the second part of the 
analysis, which compares the KOGs that hit the N-ter- 
minus to those that hit the C-terminus of each protein. 
If both ends are hit by at least one KOG, and the KOGs 
that hit the N-terminus are different from those that hit 
the C-terminus, the protein is deemed a candidate 
fusion. We further divide the candidate fusion proteins 
into two categories: those that have a single KOG hit to 
each end and those that have multiple KOG hits to at 
least one end. 

The program identifies all candidate fusions in the file 
and lists them on a web page. Each protein in this list is 
hyperlinked to a page that details the KOG hits at both 
the N-terminus and C-terminus and graphically displays 
the location of BLAST hits against the Uniprot and 
KOG databases. These results can then be examined by 
an expert to determine whether each candidate is a 
fused gene, a non-fused gene, or a sequencing or anno- 
tation error. 

Because each submission can take several hours to run 
after data are uploaded, the user is asked to submit an 
email address to be notified when the job is completed. 
When the program finishes its run, the job number is 
emailed to the user for retrieval at the Gene deFuser 
website. Gene deFuser is freely available online at: 
http:/ /DNA.pomona.edu/ deFuser/deFuser.html. 

To test this program and service, we uploaded and 
analyzed the protein set predicted by The Institute for 
Genomic Research (TIGR; now J. Craig Venter Institute) 
for Tetrahymena thermophila strain SB210 [22]. The 
current protein annotation (v.2008) [23] was down- 
loaded from the TIGR website: 

ftp://ftp.tigr.org/pub/data/Eukaryotic_Projects/t_ther- 
mophila/annotation_dbs/final_release_oct2008/ 
ttal_oct2008_finalrelease.aa.fsa 



The initial protein annotation (v.2004) [22] was also 
analyzed and the results were compared to the v.2008 
sequences. The v.2004 sequences were downloaded 
from the following location: 

ftp://ftp.tigr.org/pub/data/Eukaryotic_Projects/t_ther- 
mophila/Gene_Predictions/Preliminary_Gene_Predic- 
tions_Aug_2004.pep 

Results and discussion 

To test Gene deFuser's ability to detect fused genes, we 
used it to analyze the genome of Tetrahymena thermo- 
phila, a ciliated protozoan evolutionarily distant from 
the seven eukaryotic species used to populate the KOG 
database. We chose Tetrahymena in particular because 
of our familiarity with the biology of the organism, its 
detailed genome annotation history, and our interest in 
several of its previously described gene fusions [16,17]. 
In addition to the evolutionary gene fusions we expected 
to find with this tool, we also attempted to identify arti- 
ficial gene fusions created during the process of gene 
model annotation, by comparing the earliest round of 
gene predictions with the most recent round. 

Gene deFuser detected 80 candidate fusion genes in 
the final annotation (v.2008) of the Tetrahymena gen- 
ome. The raw results of these analyses can be accessed 
at http:/ / dna.pomona.edu/ deFuser/Results/Final_Tet/ 
Final_Tet.html and is available as Additional file 1. Brief 
descriptions of the known fusions in this genome and 
some of the more interesting new candidate fusions 
detected by Gene deFuser are listed below. Prior to this 
analysis we were aware of nine published families of 
fusion genes either known or predicted to be present in 
Tetrahymena (Table 1). Gene deFuser successfully iden- 
tified members of eight of these families and also 
revealed 19 additional families (52 new genes total) that 
appear to be genuine fusions. These results have been 
categorized and refined, and are presented in Table 1. 
The remaining 28 candidates either have too little simi- 
larity to sequences in the KOG or Uniprot database for 
us to make a valid judgment, or the architecture of the 
gene model (e.g. a large intron between the putatively 
fused domains) casts doubt on its legitimacy. It is 
important to keep in mind that the classification of the 
candidates into real fusion or false positives relies on 
the interpretation of available data, and that these 
fusions should be confirmed by experimental data if 
they prove to be of interest to the researcher. 

Known Tetrahymena Fusion Genes 

FSF1 (Genbank: EAR92957) is a gene fusion that con- 
tains a formaldehyde dehydrogenase (FALDH) domain 
at the N-terminus of the predicted protein and an 5-for- 
mylglutathione hydrolase (SFGH) domain at the C-ter- 
minus [16]. The initial Gene deFuser report shows that 
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Table 1 Fused genes detected by Gene deFuser in Tetrahymena 



N-terminus hit 



C-terminus hit 



Copies found in the 
genome 



Fusion Gene 
described or Name 
predicted in 
Tetrahymena? 



Genbank 
Accession 



FALDH 
MTNB 

dihydrofolate reductase 
P-type ATPase 



SFGH 
MTND 

thymidylate synthase 
adenylyl/guanylyl cyclase 



cyclophilin peptidyl-prolyl cis-trans SYF2 pre-mRNA splicing factor 
isomerase 

TBC1 domain GTPase activator 
2-enoyl-CoA hydratase 



SEC7-family GTPase 

peroxisomal multifunctional 
oxidation protein 

kelch repeat containing protein 

fatty acyl-CoA reductase 
leishmanolysin-like peptidase 



ser/thr kinase 



kinesin 
myosin 

kinesin CENP-E 



calmodulin dependent protein 
kinase 



MAPK ser/thr kinase 

NIMA-related kinase 

guanylate-binding protein 

ankyrin/histone H3 methyl 
transferase 

ser/thr kinase 



ser/thr phosphatase 2 

dihydroxyacetone phosphate 1 
acyltransferase 

subtilisin-like proprotein convertase 12 



O-linked N-acetylglucosamine 
transferase 



ER-golgi vesicle tethering protein 2 

Regulator of Chromosome 3 
Condensation (RCC1) 



Regulator of Chromosome 
Condensation (RCC1) 



Radial spoke protein 



Radial spoke protein 1 

Radial spoke protein 1 

ER-golgi vesicle tethering protein 1 

exosome 3-5 exoribonuclease 1 

LRR-containing protein 4 



YES 
YES 
YES 
YES 

YES 

YES 
YES 

YES 

NO 
NO 



NO 



NO 
NO 

NO 
NO 



NO 
NO 
NO 
NO 

NO 



FSF1 

MBD1 

DTS1 

PAC1 

PAC2 

CSY1 

TBS1 
MFE1 

BSU1 
BSU2 
ART1 

LSF1 

LSF2 

L5F3 

LSF4 

L5F5 

LSF6 

L5F7 

LSF8 

LSF9 

LSF10 

LSF11 

LSF12 

KOFI 

KOF2 
KOF3 
KOF4 
KEF1 
KEF2 
MY011 

MY012 

MY03 

KRC1 

KRC2 
FtSKl 

RSK2 
RSK3 
RSK4 
GVF1 
AXE1 

LRK1 
LRK2 
LRK3 



EAR92957 
EAS04801 
EAR85731 
EAS02708 
EAS03660 
EAR98967 

EDK31800 
EAS01 180 

EAR82584 
EAS02286 
EAS00429 

EAR96678 
EAR96679 
EAR96681 
EAR82776 
EAR8601 0 
EAR8601 1 
EAR8601 2 
EAR8601 3 
EAR86016 
EAR8601 7 
EAR8601 8 
EDK32083 
EAR98929 

EAS07587 
EAR94286 
EAS05661 
EAR95984 
EAR91273 
EAR87392 

EAR93163 
EAR98568 
EAR84240 

EAR88562 
EAR84708 

EAR84712 
EAS01279 
EAR95086 
EAR98751 
EAR87370 

EAR91534 
EAR87255 
EAR99973 
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Table 1 Fused genes detected by Gene deFuser in Tetrahymena (Continued) 











LRK4 


EAR9281 1 


protein phosphatase 


ER-golgi vesicle tethering protein 


I 


NO 


LRC1 


EAR89472 


PI-4-phosphate 5-kinase 


tyrosine kinase 


I 


NO 


TKL1 


EAR94148 


subtilisin-like proprotein 


teneurin-1 


2 


NO 


CVP1 


EAR94583 










CVP2 


EAS03363 


transcription factor NF-X1 


nuclear protein export factor 


I 


NO 


ZEF1 


EAS01176 


uncharacterized protein 


26S proteasome subunit 


I 


NO 


PLF1 


EAS02650 


ser/thr kinase 


Ca2+/calmodulin protein kinase 


I 


NO 


KFK1 


EAR81873 


aarF domain containing protein 


ubiquinone biosynthesis protein 


I 


NO 


ABC1 


EAS05302 



the N-terminal domain resembles alcohol dehydrogen- 
ase (ADH) Classes III and V (KOG0022 and KOG0023) 
while the C-terminus resembles Esterase D (KOG3101). 
Closer examination of these KOG hits and the list of 
similar proteins in the UniProt database shows that the 
two fused proteins function in the formaldehyde detoxi- 
fication pathway. When naming the gene, we high- 
lighted the common pathway in which these proteins 
function by choosing synonyms for ADH III/V 
(FALDH) and Esterase D (SFGH). Interestingly, this 
gene seems to also be fused in a distantly related group 
of protozoans, the diatoms, albeit in the reverse order, 
with the SFGH protein in the N-terminus and the 
FALDH protein in the C-terminus [16]. This feature 
shows that the two original proteins fused independently 
in the ciliate and diatom lineages. 

MBD1 (Genbank: EAS04801) is a fusion of two genes 
in the methionine salvage pathway, methylthioribulose- 
1-phosphate dehydratase (mtnB) and l,2-dihydroxy-3- 
keto-5-methylthiopentene dioxygenase (mtnD) [17]. 
This fusion seems to be unique to Tetrahymena and its 
closest relatives, as it is not present in the genome of 
the other fully sequenced ciliate Paramecium tetraurelia. 
Surprisingly, the Tetrahymena genome is lacking the 
enzyme that catalyses the intermediate step in the 
methionine salvage pathway between those of mtnB and 
mtnD, enolase-phosphatase El (mtnC). Complementa- 
tion tests in yeast mutants were used to show that the 
fusion gene is able to catalyze the intermediate (mtnC) 
step of the pathway in addition to the two expected 



reactions, indicating a gain of function as a result of the 
fusion [17]. 

DTS1 (Genbank: EAR85731) is a fusion of dihydrofo- 
late reductase and thymidylate synthase, a well-known 
fusion found in bikont organisms (plants, most proto- 
zoan species) but absent in unikonts (animals, fungi, 
and amoebas) that was used to root the eukaryotic phy- 
logenetic tree [9]. Even though this gene is fused in Ara- 
bidopsis thaliana, one of the organisms used to create 
the KOG database, we were able to detect it thanks to 
the manual curation of the KOG database that broke 
down fused genes into their component domains. 

The proteins PAC1 and PAC2 (Genbank: EAS02708 
and EAS03660) each contain a P-type ATPase domain 
and an adenylyl/guanylyl cyclase domain. A fusion 
between these two genes was previously described in 
another Tetrahymena species, T. pyriformis [24]. In 
addition, the same fusion is was shown to be present in 
the ciliate Paramecium tetraurelia and in the apicom- 
plexan Plasmodium falciparum, suggesting that it parti- 
cipates in a shared form of signal transduction in these 
closely related species [24]. 

MFE1 (Genbank: EAS01180) is part of a well- 
described peroxisomal multifunctional enzyme family 
with homologs in all types of unikonts, but with few 
homologs among the bikonts. Only the alveolates show 
homologs of these proteins, most likely indicating inde- 
pendent origins for these fusions rather than multiple 
losses from many paraphyletic bikonts. Functional stu- 
dies have been performed on the Toxoplasma gondii 



Table 2 False positives detected by Gene deFuser 



Annotation 
Version 



Number of False Positives 



Accession Numbers 



Final 

Annotation 
(v.2008) 

Initial 

Annotation 
(v.2004) 1 



28 



29 (in addition to the 28 that are still 
present in the Final Annotation) 



EAR92881, EAS01798, EAR92566, EAR82879, EAS00133, EAR84691, EAR92830, EAR99356, 
EAR91587, EAR84275, EAR84417, EAR99401, EAR83898, EAR89871, EAR85428, EAS03452, 
EAS02693, EAR83154, EAR82303, EAS00607, EAR83089, EAR85121, EAR89363, EAR91270, 
EAR96069, EAR96106, EAR86245, EAR86074 

EAR96923, EAR84622, EAR85248, EAR85505, EAR85282, EAR85413, EAR97343, EAR99583, 
EAS01392, EAS00371, EAS04594, EAR87314, EAS03022, EAS02070, EAS03869, EAR99890, 
EAR82527, EAR85830, EAS07404, EAR99312, EAR89578, EAR91857 



1 These genes were removed in the final version of the annotation. Note that even though 29 false positives were identified, only 22 accession numbers are 
listed. The remaining 7 false positives were eliminated before the sequences were submitted to GenBank, and thus have no accession number. 
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version of the protein that demonstrate its involvement 
in cholesterol uptake [25]. 

Two copies of a serine-threonine protein phosphatase 
with Kelch-like repeats (PPKLs), BSU1 (Genbank: 
EAR82584) and BSU2 (Genbank: EAS02286), are homo- 
logs of a suppressor of brassinolide receptor kinase 
mutations described in Arabidopsis [26]. Prior to the 
sequencing of extensive protist and algal species, the 
distribution of these genes was found to be limited to 
plants and apicomplexans [27]. Results from Gene deFu- 
ser led us to identify BSU1 and BSU2, and further inves- 
tigation led us to BSU3 (Genbank: EAR83784), another 
homolog with a variant Kelch-domain that prevented its 
identification by our program. Additional homologs 
were identified during our subsequent BLAST searches 
in other alveolates and in green algae. 

CSY1 (Genbank: EAR98967) is a fusion between a 
peptidyl prolyl isomerase (cyclophilin) and a homolog of 
the yeast RNA splicing factor SYF2. This gene and its 
ortholog in Paramecium have been identified previously 
as members of a family of genes found only in alveo- 
lates, with the exception of the green algal species 
Ostreococcus tauri [28]. The specific properties of this 
fusion have not yet been explored, but its merit as a 
drug target for alveolate parasites has been noted. TBS1 
(Genbank: EDK31800) is a small GTPase of the SEC7 
family fused to a TBCl-related GTPase activating pro- 
tein. Like the cyclophilin/SYF2 genes, fusions of these 
two secretory pathway proteins are believed to comprise 
a family unique to alveolates [29]. 

While Gene deFuser was able to identify the eight 
types of fusion genes listed above, it did miss two genes 
that we expected it to find, TBS2 (Genbank: EAR85277) 
and CYC13 (Genbank: EAR91121). TBS2 is a paralog of 
the TBS1 gene described above. Although the program 
did not detect TBS2, it did return one fusion belonging 
to this family. The only unique gene fusion that we 
expected to find but was missed by the program was 
CYC13. The CYC13 fusion links a cyclin protein to a 
cyclin-dependent kinase (CDK) and was first observed 
in a screen of cell cycle-specific genes in the ciliate 
Eufolliculina uhligi [30]. BLAST searches of both TBS2 
and CYC13 show no similarity to known sequences for 
large portions of the N-terminus of each gene (32% of 
the TBS2 sequence and 35% of the CYC13 fusion). It is 
not clear whether these N-terminal sequences are 
indeed part of the actual proteins, but these extra 
sequences with no homology explain why neither pro- 
tein was identified by our program. Increasing the 
amount of the protein sequence used to BLAST the 
KOG database from its default value of 30% to 45% did 
not help in identifying these sequences. In the case of 
these proteins, a sliding window approach would likely 
overcome this limitation in the software, as the different 



KOGs that hit these sequences do not overlap. Such a 
methodology might be implemented in a future version 
of the program. 

New Tetrahymena Fusion Genes 

One of the most useful applications of Gene deFuser, 
and the detection of fused genes in general, is to allow 
the generation of hypotheses that later can be tested 
experimentally. For example, the fusion MBD1 described 
in the previous section was first detected during the 
testing of an early version of this program. Based on the 
lack of mtnC in the Tetrahymena genome, we hypothe- 
sized that this fusion of mtnB and mtnD also catalyzes 
the mtnC reaction. We then successfully showed this to 
be the case using yeast deletion strains [17]. In addition 
to previously described fused genes such as this, Table 1 
lists several as yet uncharacterized fusions among the 80 
candidate fusions detected by the program. Here we 
describe some of the more interesting fusions found in 
this list. 

The first is a fusion between a long-chain fatty acyl- 
CoA reductase and dihydroxyacetone phosphate acyl- 
transferase (DHPAT) (Genbank: EAS00429) that we 
have called ART1. These two enzymes catalyze sequen- 
tial steps in the production of membrane phospholipids. 
Fusions of these genes are distributed in an odd pattern 
among several eukaryotic groups, suggesting either mul- 
tiple evolutionary gains or losses of this fusion. The 
fusion is present in the ciliates T. thermophila (Gen- 
bank: EAS00429) and P. tetraurelia (Genbank: 
XP 001433255), but not in other alveolates whose gen- 
omes have been fully sequenced, such as the dinoflagel- 
late Perkinsus marinus or the apicomplexans 
Plasmodium, Toxoplasma, Babesia or Cryptosporidium. 
The fusion is also present in the amoebozoans Dictyoste- 
lium discoideum (Genbank: XP_636393) and Polysphon- 
dylium pallidum (Genbank: EFA75040). Fusions of 
these genes are also found in one stramenopile, Phy- 
tophthora infestans (Genbank: XP_002902570), but not 
in other stramenopiles like Thalassiosira pseudonana or 
Phaeodactylum tricornutum. Likewise, a fusion is pre- 
sent in the excavate Naegleria gruberi (Genbank: 
XP 002683520), but not in other excavates like Giardia 
intestinalis or Trichomonas vaginalis. 

Many of the remaining gene fusions detected in Tetra- 
hymena appear to belong to expanded gene families. 
With twelve copies present in the genome, the most 
common fusion detected was a protein formed by join- 
ing leishmanolysin and a subtilisin-like proprotein con- 
vertase (Genbank: EAR96678, EAR96679, EAR96681, 
EAR82776, EAR86010, EAR86011, EAR86012, 
EAR86013, EAR86016, EAR86017, EAR86018, 
EDK32083), both of which are peptidases [31] that loca- 
lize to the cell surface [32]. It has been noted that 
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leishmanolysins constitute a greatly expanded protein 
family in Tetrahymena [22], suggesting that protein pro- 
cessing at the cell surface may be particularly complex 
in ciliates. It is possible that the fusions identified here 
might simplify these types of reactions at the cell sur- 
face. Additionally, in mice, the genes that code for both 
these proteins are regulated by the protein Nrf2 and are 
co-regulated by the anti-tumor compound curcumin 
[31]. These connections further suggest that these pro- 
teins contribute to a common process and that the 
fusion may have some significance in Tetrahymena. 

Also present are four copies of a serine/threonine 
kinase fused with O-linked N-acetylglucosamine trans- 
ferase (Genbank: EAR98929, EAS07587, EAR94286, 
EAS05661). Serine/threonine kinases phosphorylate pro- 
teins on the hydroxyl group of specific serine or threo- 
nine residues [33], while O-linked N-acetylglucosamine 
transferases instead attach a single |3 -O-linked N- acetyl- 
glucosamine to serine and threonine residues [34]. Since 
these enzymes could compete for the same phosphoryla- 
tion/glycosylation sites, a fusion of the two catalytic 
domains might provide a simple way to regulate this 
competition. 

Several of the fusions present in Tetrahymena involve 
the motor proteins myosin and kinesin. A fusion of 
kinesin with an ER-golgi vesicle tethering protein (Gen- 
bank: EAR95984, EAR91273) might participate in ante- 
rograde vesicle movement from the ER to the Golgi, 
which is known to be mediated directly by kinesin [35]. 
Three fusions (Genbank: EAR87392, EAR93163, 
EAR98568) are found between myosin and RCC1, a 
nuclear Ran-GEF that promotes transport of cargo 
across the nuclear membrane [36]. Myosins have been 
found in the nucleus, and some types have been shown 
to localize specifically at the nuclear pore complex [37]. 
Thus, it is possible that the myosin-RCCl fusions identi- 
fied are involved in nucleocytoplasmic transport. Two 
kinesin-RCCl fusions (Genbank: EAR84240, EAR88562), 
on the other hand, might serve a different function. 
While the KOG hits in the Gene deFuser results do not 
specify the type of kinesin involved in the fusion, the 
results of the BLAST search against Uniprot show the 
best match is to part of Centromere Protein E (CENP- 
E), a kinetochore-associated kinesin. CENP-E has been 
implicated as a sensor that mediates the capture of 
microtubules at the kinetochore and relays this to the 
checkpoint machinery [38]. During mitosis RCC1 is 
responsible for the production of Ran-GTP, which is 
known to stimulate the release of checkpoint proteins 
from the kinetochores [38], thus overcoming the cell 
cycle checkpoint at the end of mitosis. The fusion of 
these proteins might provide a streamlined mechanism 
for cell cycle regulation during micronuclear mitosis, or 
may be involved somehow in the poorly understood 



separation of acentromeric chromosomes during 
amitosis. 

Detection of Annotation Artifacts 

Of the 80 Tetrahymena genes identified by Gene deFu- 
ser, we believe that 52 are likely to represent actual 
fusions (Table 1). The majority of the remaining 28 
(Table 2) are most likely artifacts created by faulty start/ 
stop codon identification during gene model annotation. 
When viewing these genes in the genome browser at 
the Tetrahymena Genome Database [39], most show 
that the two domains are separated by an abnormally 
long non-coding region, which we believe represents 
intergenic regions miscalled as introns. 

Earlier versions of the Tetrahymena genome are avail- 
able from the TIGR (now J. Craig Venter Institute) web- 
site, and Gene deFuser analyses of these proteins return 
different results. The initial annotation (v.2004), con- 
tained 105 candidate gene fusions, compared with the 
80 found in the current annotation (v.2008). Most of 
these genes (76) were present in both versions and were 
identified by Gene deFuser. Twenty-nine spurious 
fusions resulting from annotation artifacts were sepa- 
rated or eliminated from the annotation over the inter- 
vening period, whereas 4 new putative fusions 
(Genbank: EAS02650, EAS02286, EDK31800, 
EDK32083) were introduced into the annotation. 

We believe that 25 of the 28 false positive gene 
fusions detected by Gene deFuser represent genes that 
have not yet been separated by annotators. Twenty-four 
of these 25 genes have an intron larger than 413 bp 
located between the two domains that comprise the 
putative fusion. The median intron length in Tetrahy- 
mena is 86 bp and only 8.3% of the introns in this spe- 
cies are larger than 400 bp (data not shown). That such 
large introns are located between the two domains sug- 
gests these introns were miscalled, resulting in the 
fusion of adjacent gene models. The presence of paired 
EST reads matching only the 3'end of the other gene 
(EAR99401) indicates that it too is an annotation arti- 
fact. One of the three remaining candidate fusions is a 
non-fused Ca 2+ /calmodulin dependent kinase gene pre- 
sent in many organisms, which Gene deFuser misclassi- 
fied as a candidate fusion based on hits to several 
different kinase families (EAR82879). We judged the 
two remaining candidates to be false positives based on 
low BLAST scores (EAR96106 and EAR89363). Though 
these appear at first glance to be false positives, addi- 
tional data may prove several of these 28 genes to be 
genuine fusions. 

The detection of these annotation artifacts highlights 
another possible use for Gene deFuser, as a tool to aid 
in the refinement of gene models during genome 
sequencing projects. Since a large portion (51%; 54/105) 
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of the genes detected by the program in the preliminary 
annotation were gene model fusion artifacts, this tool 
could be used following the initial annotation of new 
genomes to identify some of the more obvious fusion 
artifacts. Gene deFuser can generate a list of putative 
fusions for annotators to evaluate using their own cri- 
teria, which are likely to differ based on the quality of 
the initial annotation and the uniformity found in the 
lengths of introns and intergenic regions. 

Conclusions 

Fused genes are a large untapped source of data for stu- 
dies of molecular evolution and protein function. The 
new program described in this paper promises to speed 
the identification of fusions in a wide variety of organ- 
isms, with the most interesting results likely to come 
from more evolutionarily diverse species. Our applica- 
tion of Gene deFuser to the Tetrahymena genome illus- 
trates the large number of new fusion genes waiting to 
be found in more exotic eukaryotic genomes. In this 
study alone we have identified new fusions involving a 
wide variety of proteins, including nucleases, proteases, 
motor proteins, and kinases. It is reasonable to expect 
an equally interesting collection of fusion genes in the 
genomes of other divergent eukaryotes. 

Additional material 



Additional file 1: Results of Gene deFuser for the Tetrahymena 
thermophila genome. This zip file contains the raw results of the 
analysis of the Tetrahymena genome using Gene deFuser. To view the 
contents, unzip the file and open the Final_Tet.html file in the resulting 
folder. 
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